Skip to content

Conversation

@sarthakaggarwal97
Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Oct 9, 2025

Resolves #2696

The primary issue was that with sanitizer mode, the test needed more time for primary’s replication buffers grow beyond 2 × backlog_size. Increasing the threshold of repl-timeout to 30s, ensures that the inactive replica is not disconnected while the full sync is proceeding. rdb-key-save-delay controls or throttles the data written to the client output buffer, and in this case, we are deterministically able to perform the fullsync within 10s (10000 keys * 0.001s).

Increasing the wait_for_condition gives it enough retries to verify that mem_total_replication_buffers reaches the required 2 × backlog_size.

The test is passing for past 7 consecutive iterations for test-sanitizer-address in my daily runs. I amended log to show the current backlog if it doesn't reach 2x.

Signed-off-by: Sarthak Aggarwal <[email protected]>
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.62%. Comparing base (155b0bb) to head (db6e772).
⚠️ Report is 13 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2715      +/-   ##
============================================
+ Coverage     72.40%   72.62%   +0.21%     
============================================
  Files           128      128              
  Lines         71273    71273              
============================================
+ Hits          51606    51759     +153     
+ Misses        19667    19514     -153     

see 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give some clues? is that how (or where) did you determine the timeout and why increasing rdb-key-save-delay helped? We can add more info in the top comment and then merge it into. The analysis info is also very useful in these timing issues

@sarthakaggarwal97
Copy link
Contributor Author

@enjoy-binbin thank you for taking a look, I shared my thought process in the PR description, please let me know if it doesn't makes sense!

@enjoy-binbin enjoy-binbin merged commit 981b8fe into valkey-io:unstable Oct 16, 2025
52 checks passed
diego-ciciani01 pushed a commit to diego-ciciani01/valkey that referenced this pull request Oct 21, 2025
Resolves valkey-io#2696 

The primary issue was that with sanitizer mode, the test needed more
time for primary’s replication buffers grow beyond `2 × backlog_size`.
Increasing the threshold of `repl-timeout` to 30s, ensures that the
inactive replica is not disconnected while the full sync is proceeding.
`rdb-key-save-delay` controls or throttles the data written to the
client output buffer, and in this case, we are deterministically able to
perform the fullsync within 10s (10000 keys * 0.001s).

Increasing the `wait_for_condition` gives it enough retries to verify
that `mem_total_replication_buffers` reaches the required `2 ×
backlog_size`.

Signed-off-by: Sarthak Aggarwal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] Primary COB growth with inactive replica

3 participants